Submitted by: Susan Bataju
"Chest X-Ray Images (Pneumonia)“ from Kaggle was chosen for Lab 2. The dataset contains 5863 OCT (Optical Coherence Tomography) and Chest X-ray validated images split into two categories (Pneumonia and Normal) selected from pediatric patients ranging from one to five years old from Guangzhou Women and Children’s Medical Center, Guangzhou [2].The dataset was first collected, organized, and analyzed in “Labeled Optical Coherence Topography (OCT) and Chest X-Ray Images for Classification” [1] by Kermany, D; Zhang, K et al. Images are labeled as (disease)-(randomized patient ID)-(image number by this patient). There are bacterial and viral pneumonia infections mixed in Pneumonia dataset.
The purpose of collecting this dataset is to detect pneumonia in patients using their chest X-Rays. A classifier with high accuracy which can detect pneumonia using X-Ray images will be revolutionary for doctors around the world. The uses of such a classifier can range from verifying doctors’ assignments to the reduction of their workload or use in remote places with fewer qualified human resources.
Medical Industry will have major business interest in such a classifier. Someone’s life and well being might depend upon the accuracy of model as such it should have accuracy of greater than 98% but that is depended on use case. Even a few percentage of inaccuracy with largescale used can have adverisal effect on users.
The images are of various sizes and aspect ratio. To make uniform sized images and avoid distortion like streching when changing the aspect ratio all the images are cropped. If the height of the image is greater than its width the top and the bottom portions of image are cropped by equal amount until the height and width are equal and when width is greater than height then the right and left portion are cropped. After the images are cropped they will have aspect ratio of 1 so that the images can be scaled without any distortions. Generally, the cropped image will contain most portion of lungs where as the empty part in left and right and head and lower portion of the images are cropped, retaining the most useful information.
While all the images are gray scaled images, there are about 280 images in pneumonia images where the images are RBG, those images are transformed to grayscale images before use.
PCA, Daisy and Gabor filter are test as methods of feature extraction. The performation of these methods are tested with a KNeighborsClassifier.
import numpy as np
import pandas as pd
import glob
import os
import math
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from PIL import Image
from copy import deepcopy
from skimage.transform import resize
from ipywidgets import widgets # make this interactive!
import seaborn as sns
import copy
#using glob to get a list of all the jpeg files from the dataset.
train_norm = glob.glob("/Users/48391139/Downloads/chest_xray/train/NORMAL/*.jpeg")
train_pnom = glob.glob("/Users/48391139/Downloads/chest_xray/train/PNEUMONIA/*.jpeg")
test_norm = glob.glob("/Users/48391139/Downloads/chest_xray/test/NORMAL/*.jpeg")
test_pnom = glob.glob("/Users/48391139/Downloads/chest_xray/test/PNEUMONIA/*.jpeg")
#read the images using `matplotlib.image.imread`, save the name of the file and the image in a dictanary
normal_image={}
normal_image_test={}
pneumonia_image={}
pneumonia_image_test={}
for count,img_file in enumerate(train_norm):
img = mpimg.imread(img_file)
normal_image[img_file.split('/')[-1].split('.')[0]]=img
for count,img_file in enumerate(test_norm):
img = mpimg.imread(img_file)
normal_image_test[img_file.split('/')[-1].split('.')[0]]=img
for img_file in train_pnom:
img = mpimg.imread(img_file)
pneumonia_image[img_file.split('/')[-1].split('.')[0]]=img
for img_file in test_pnom:
img = mpimg.imread(img_file)
pneumonia_image_test[img_file.split('/')[-1].split('.')[0]]=img
Following are the first few images in normal and pneumonia dataset. We can see that the images are not same size and aspect ratio.
plt.figure(figsize=(6,6))
grid = 4
for count,(name,image) in enumerate(normal_image.items()):
if count+1 > grid:continue
lx,ly=image.shape
if lx==ly:
crop_image = Image.fromarray(image)
X, Y = np.ogrid[0:lx, 0:ly]
plt.subplot(int(grid/2), int(grid/2), count+1)
plt.imshow(image,cmap='gray')
plt.title(name)
plt.suptitle("Normal")
plt.tight_layout()
plt.show()
plt.figure(figsize=(6,6))
for count,(name,image) in enumerate(pneumonia_image.items()):
if count+1 > grid:continue
lx,ly=image.shape
if lx==ly:
crop_image = Image.fromarray(image)
X, Y = np.ogrid[0:lx, 0:ly]
plt.subplot(int(grid/2), int(grid/2), count+1)
plt.imshow(image,cmap='gray')
plt.title(name)
plt.suptitle("Pneumonia")
plt.tight_layout()
plt.show()
Below is a large ~600 images grid in Normal dataset. They are croped as described in introduction.
plt.figure(figsize=(75,75))
col = 20
row = 20
for count,(name,image) in enumerate(normal_image.items()):
if count+1 > row*col:continue
lx,ly=image.shape
if ly>lx:
crop_image = Image.fromarray(image).crop((math.floor(np.abs(ly-lx)/2),0,ly-math.floor(np.abs(ly-lx)/2),lx))
if ly<lx:
crop_image = Image.fromarray(image).crop((0,math.floor(np.abs(ly-lx)/2),ly,lx- math.floor(np.abs(ly-lx)/2)))
if ly==lx:
crop_image = Image.fromarray(image)
plt.subplot(int(row), int(col), count+1)
tx,ty = np.shape(crop_image)
plt.imshow(np.asarray(crop_image),cmap='gray')
plt.xticks(())
plt.yticks(())
plt.suptitle("Normal",fontsize='xx-large')
plt.show()